Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent StepExecutionIterator from leaking memory in cases where a single processed execution has a stuck CPS VM thread #347

Merged
merged 6 commits into from
Aug 9, 2024

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Aug 6, 2024

I recently saw a real world case (ref. SECO-3962) where a Pipeline build’s flow graph had a loop due to some type of corruption (I suspect a crash while the Pipeline was running lead to a build.xml whose iota and head were older than the most recent FlowNode in workflow/ and its StepExecution in program.dat), which caused its CpsVmExecutorService to get stuck in an infinite loop on a single task. This caused all calls to StepExecution.applyAll to leak all live WorkflowRuns at the time of the call because StepExecutions for those builds were strongly reachable from the CPS VM thread for the stuck build via the SingleLaneExecutorService task queue for the CpsVmExecutorService and the internals of Guava’s Futures.allAsList.

Here is the leak path (produced via MemoryAssert)
[Ljava.lang.Thread;@52454457-[38]->
java.lang.Thread@6dbe3b4f-threadLocals->
java.lang.ThreadLocal$ThreadLocalMap@3b25709e-table->
[Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;@32a29361-[13]->
java.lang.ThreadLocal$ThreadLocalMap$Entry@cdf3ee2-value->
// This CpsVmExecutorService is for the stuck build.
org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$$Lambda$775/0x000000030078f4d0@3b60efd8-arg$1->
org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService@50b736a0-base->
hudson.remoting.SingleLaneExecutorService@72e09a66-tasks->
java.util.concurrent.LinkedBlockingQueue@541e333d-last->
java.util.concurrent.LinkedBlockingQueue$Node@696552a4-item->
java.util.concurrent.FutureTask@1343f77d-callable->
java.util.concurrent.Executors$RunnableAdapter@774ecd5a-task->
org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$$Lambda$772/0x000000030078afc8@6d14d82b-arg$3->
org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1@6dd1f638-this$1->
org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4@2cbc5084-val$callback->
org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5@2b4abaaa-val$r->
// These references through futures are from the usage of `Futures.allAsList` in `FlowExecutionList$StepExecutionIteratorImpl.applyAll`
com.google.common.util.concurrent.SettableFuture@371f2f2a-listeners->
com.google.common.util.concurrent.AbstractFuture$Listener@7effd655-task->
com.google.common.util.concurrent.AggregateFuture$$Lambda$815/0x00000003007eb4b8@8f2f390-arg$1->
com.google.common.util.concurrent.CollectionFuture$ListFuture@752edffc-values->
java.util.ArrayList@741fc2e0-elementData->
[Ljava.lang.Object;@460cb18f-[1]->
com.google.common.util.concurrent.CollectionFuture$Present@14a432cd-value->
java.util.Collections$UnmodifiableRandomAccessList@72240090-list->
java.util.ArrayList@53dece06-elementData->
[Ljava.lang.Object;@67e46851-[0]->
// This step and its run are from a completed build that is _not_ the stuck build, which causes the memory leak.
org.jenkinsci.plugins.workflow.test.steps.SemaphoreStep$Execution@5664556b-context->
org.jenkinsci.plugins.workflow.cps.CpsStepContext@2cd3b80f-executionRef->
org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner@4d626ff-run->
org.jenkinsci.plugins.workflow.job.WorkflowRun@43a80df5}

There are a lot of things that could be improved here, but I am not sure where our effort is best spent given that I have not previously seen corruption cause an infinite loop like this:

  1. This PR: Change the result type of StepExecutionIteratorImpl.applyAll to ListenableFuture<List<Void>> instead of ListenableFuture<List<StepExecution>>. The result type in the API is ListenableFuture<?>, and the return value is only intended to be used to track progress and cancel the operation if needed, so this change should be compatible. This change means the combined future drops its references to each build's StepExecutions as soon as they have been processed, rather than having to hold a reference to them until all build's StepExecutions have been processed.
  2. We could add infinite loop detection to all Pipeline flow graph iteration constructs, e.g. LinearBlockHoppingScanner, LinearScanner, DepthFirstScanner, ForkScanner, StandardGraphLookupView.bruteForScanForEnd, etc., and throw an exception whenever any infinite loop is encountered.
  3. We could add monitoring to CpsVmExecutorService to warn when a build’s SingleLaneExecutorService seems to be having problems, e.g. the queue size is only increasing, the executing task has been live for more than 15 minutes, tasks are starting more than 15 minutes after they were submitted, etc.
  4. We could add an admin monitor and/or support bundle components that report when a build has been live for more than (for example) 7 days, so that admins are aware that something might be wrong and they can try to kill the build (at the time I investigated the problematic build, it was around 3.5 months old).
  5. Similarly, we could change Pipeline resumption logic to not attempt to resume any build started more than (for example) 14 days ago (at the time I investigated the build, it had tried to resume 16 times over the 3.5 month period).

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@dwnusbaum dwnusbaum added bug tests and removed bug labels Aug 6, 2024
…<List<Void>> instead of Future<List<StepExecution>>
@dwnusbaum dwnusbaum changed the title Add ignored test for corrupt flow graph with a loop that lead to a memory leak Avoid memory leak by changing return value of StepExecutionIteratorImpl$applyAll to Future<List<Void>> instead of Future<List<StepExecution>>` Aug 7, 2024
@dwnusbaum dwnusbaum changed the title Avoid memory leak by changing return value of StepExecutionIteratorImpl$applyAll to Future<List<Void>> instead of Future<List<StepExecution>>` Avoid memory leak by changing return value of StepExecutionIteratorImpl$applyAll to Future<List<Void>> instead of Future<List<StepExecution>> Aug 7, 2024
@dwnusbaum dwnusbaum changed the title Avoid memory leak by changing return value of StepExecutionIteratorImpl$applyAll to Future<List<Void>> instead of Future<List<StepExecution>> Avoid memory leak by changing return value of StepExecutionIteratorImpl.applyAll to Future<List<Void>> instead of Future<List<StepExecution>> Aug 7, 2024
@dwnusbaum dwnusbaum changed the title Avoid memory leak by changing return value of StepExecutionIteratorImpl.applyAll to Future<List<Void>> instead of Future<List<StepExecution>> Prevent StepExecutionIterator from leaking memory in cases where a single processed build has a stuck CPS VM thread Aug 9, 2024
@dwnusbaum dwnusbaum changed the title Prevent StepExecutionIterator from leaking memory in cases where a single processed build has a stuck CPS VM thread Prevent StepExecutionIterator from leaking memory in cases where a single processed execution has a stuck CPS VM thread Aug 9, 2024
@dwnusbaum dwnusbaum added bug and removed tests labels Aug 9, 2024
notStuckBuild = null; // Clear out the local variable in this thread.
Jenkins.get().getQueue().clearLeftItems(); // Otherwise we'd have to wait 5 minutes for the cache to be cleared.
// Make sure that the reference can be GC'd.
MemoryAssert.assertGC(notStuckBuildRef, true);
Copy link
Member Author

@dwnusbaum dwnusbaum Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the fix and this will fail. The PR description shows the reference path preventing the build from being cleaned up.

@dwnusbaum dwnusbaum marked this pull request as ready for review August 9, 2024 17:26
@dwnusbaum dwnusbaum requested a review from a team as a code owner August 9, 2024 17:26
@dwnusbaum dwnusbaum requested a review from jglick August 9, 2024 17:26
}
return null;
}, MoreExecutors.directExecutor());
ListenableFuture<Void> resultsWithWarningsLogged = Futures.catching(results, Throwable.class, t -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. Project Loom would be very welcome in code like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more carefully through Guava's Javadoc I think I can use FluentFuture to simplify this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually FluentFuture doesn't really make things any clearer, so I will leave it.

var stuck = r.createProject(WorkflowJob.class, "stuck");
stuck.setDefinition(new CpsFlowDefinition("blockSynchronously 'stuck'", false));
var stuckBuild = stuck.scheduleBuild2(0).waitForStart();
await().atMost(5, TimeUnit.SECONDS).until(() -> SynchronousBlockingStep.isStarted("stuck"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or waitForMessage would probably suffice.

await().atMost(5, TimeUnit.SECONDS).until(() -> SynchronousBlockingStep.isStarted("stuck"));
// Make FlowExecutionList$StepExecutionIteratorImpl.applyAll submit a task to the CpsVmExecutorService
// for stuck #1 that will never complete, so the resulting future will never complete.
StepExecution.applyAll(e -> null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return value be kept in a local variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary for the leak, if that's what you mean. Even if the StepExecution.applyAll caller completely ignores the return value, the stuck CPS VM thread holds a strong reference to the aggregate future because the SingleLaneExecutorService's task queue has a reference to the SettableFuture for the getCurrentExecutions call, and then that future references the aggregate future via a listener just due to the implementation of Futures.allAsList.

@jglick jglick merged commit c211223 into jenkinsci:master Aug 9, 2024
16 checks passed
@dwnusbaum dwnusbaum deleted the infinite-loop-memory-leak branch August 9, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants